计算机与现代化 ›› 2013, Vol. 1 ›› Issue (5): 22-27.doi: 10.3969/j.issn.1006-2475.2013.05.006

• 算法设计与分析 • 上一篇    下一篇

数据倾斜情况下基于MapReduce模型的连接算法研究

金健,陈群,赵保学   

  1. 西北工业大学计算机学院,陕西西安710072
  • 收稿日期:2013-01-11 修回日期:1900-01-01 出版日期:2013-05-28 发布日期:2013-05-28

Research on Data Skew Join Algorithm Based on MapReduce Model

JIN Jian, CHEN Qun, ZHAO Bao-xue   

  1. School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
  • Received:2013-01-11 Revised:1900-01-01 Online:2013-05-28 Published:2013-05-28

摘要: 基于MapReduce的连接算法的研究是海量数据研究领域的一个重要内容,但都集中在数据分布均匀的情况下进行算法优化,而在实际应用中数据分布往往是不均匀的。本文基于此背景,提出一种适合在数据严重倾斜时使用基于MapReduce编程模型的连接算法Skew Control Join,算法通过采样获取数据集的整体分布,通过全局分区将数据集进行分割,使倾斜数据的处理平均分配到所有的Reduce任务上。实验表明在数据倾斜时,本文提出的算法具有良好的性能,达到研究目标。

关键词: 连接算法, 数据倾斜, 全局分区, 采样

Abstract: The study of join algorithm based on MapReduce is a hot topic in massive data research area. However, most current optimization work is based on the assumption that the data are evenly distributed. In practical applications, the data to be processed are often skew in distribution. This paper proposes a MapReduce join algorithm called Skew Control Join, which is adaptive for serious skew data. The algorithm gets the overall data distribution by sampling, then partitions the data by total partitioner to distribute the data evenly to all Reduce tasks. Experiment results show that the algorithm is of good performance when the processed data are skew.

Key words: join algorithm, data skew, total partition, sample

中图分类号: